Merge and Clean Data

This notebook merges preview and campaign data and performs basic cleaning, geographic preprocessing, text preprocessing, and EDA.

1) Merging Raw Data Frames

In the first step, the raw preview and the raw campaign data frame are merged. They mainly contain text-based data, as these are the raw outputs from the scraper.

2) Basic Data Cleaning

The task for the basic data cleaning section is to convert the columns to more practical types and to derive all necessary variables that are needed for further analysis.

To clean the camapaign_count column, which indicates how many previous campaigns a founder has launched, all values that are not numeric are removed with a regex function. The resulting number can then be converted to an integer.

Almost all rows in the data are campaigns that are already finished, either successfully or unsuccessfully. The ca. 2000 campaigns that are still ongoing are removed from the data.

Around 200 campaigns that had missing values in the campaign title or description were dropped because those were outliers where the campaign creator set up an incomplete campaign. This ensures comparability between campaigns and facilitates the derivation of text features in later steps. In contrast, campaigns with missing values in the story were not erroneous but instead represent a story that only includes images (Example). Hence, they were kept in the data and imputed by an empty string

The amt_raised column is used to extract multiple highly important features. This includes the funding amount, the currency, and the percentage of the funding goal that is reached.

All funding amounts were converted to dollar equivalents, using exchange rates from yahoo finance.

Now, the original funding goal and a binary success variable can be derived.

The piechart below shows the skewed distribution of funding success, with only about 20% of campaigns achieving their funding goal.

Infer the "supercategory" or "sector" from the category that is present in the data.

Rename columns for more consistent naming and intuitive plotting later on.

3) Geoencode and cluster geographic data

Cleaning and geocoding the address column required multiple steps. It starts by simply making use of the uniform formatting and splitting the addresses into the city and country part, based on the comma in the string formatting.

I am mainly interested in the USA campaigns as they constitute the largest part of the data. However, many campaigns list a US state as their country in a format like "New York, NY". To also classify these campaigns as US campaigns, a dictionary of all US states and their abbreviations is loaded. The abbreviations are augmented, by e.g. removing the punctuations so "N.Y." and "NY" would both be captured as "NY". All states and abbreviations were converted to "United States" to match all other campaigns from the US.

The following plot visualizes the dominance of the US in the data.

To create a homogeneous data set. The subsequent analysis is focused only on campaigns from the US.

The next part is concerned with the geocoding of the addresses, in order to get a longitude and latitude value for each campaign. For this the Google Maps API is used, which can convert address strings into a json file with all the necessary information. However, the API is a paid service with only a limited number of free calls, so calling the API once for each data point and with each run of the script should be avoided. Instead, I create a dictionary for each unique address, that can be saved locally. This also ensures reproducibility, as the private API keys are not needed to rerun the script.

The following code can either create a new geo dictionary or load an existing one from the data folder. Creating a new dictionary however will fail without supplying the necessary API key in the function above.

By mapping the finished dictionary to the address column, a new column of google geo objects is created, from which all necessary information can be extracted by simple lambda functions.

With these coordinates, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is used to segment the data into multiple high-density clusters and one low-density cluster. From this, a new "urban" variable is constructed where each point that belongs to a high-density cluster has the value 1 and has 0 otherwise.

In the last step of the geographic preprocessing the economic data on a state level can be merged to the data frame.

4) Text preprocessing

Preprocessing the text data is a relatively straightforward process, as the procedure is the same for all text columns, i.e. the title, description, and story. It starts with basic text cleaning, i.e., removing stop words, special characters, numbers, and converting all words to lower case as specified in the function below.

Then the TextBlob library is used to calculate a polarity and subjectivity score. The former measures the sentiment of a text, whereas the latter quantifies how emotionally a text is written. Lastly, the word count of each title, description, and story is calculated. The libray employs a dictionary-based approach, where each word is matched against a dictionary that has words with positive/negative sentiment and high/low emotion. The final score is then an average over all words in the text sample that is evaluated.

With the text processing, the data derivation and cleaning part is concluded and the final data is saved as a pickle file.

5) EDA

This section conducts broad exploratory data analysis that was used to familiarize myself with the data and potentially uncover some interesting patterns. Not every graphic is 100% directed towards answering the final research question. Nonetheless, I left many plots in the notebook to show my thinking when I tried to look at the data from different angles during the first investigation.

First, the summary stats of each variable can be looked at with the describe function. We can instantly see the e.g. the average funding amount of $17'567, but only succeedes with a 20% likelihood. Also that the average campaign uses 5 images, no videos and 700 words to describe their campaign.

These graphs show how different variables are distributed for successful vs unsuccessful campaigns. The biggest shift differences are as expected in the funding amount that was raised. Unsuccessful campaigns also have a higher funding goal on average, use fwewer images and videos, and have shorter stories.

The following figure shows the number of campaigns in each sector and category. One can quickly see that the creative works sector has the most campaigns and that the most popular campaign category is dance and theater. Overall the popularity of different categories varies a lot.

This plot is a nice visualization of market size and funding success per category and sector. One can see that market size and success rates depend a lot on sector and category. Especially the technology sector raises the most funds. Average success rates are more equal across categories than market size.

The following graph looks at potential patterns of other variables across categories. Funding amounts and success rates differ a lot across categories, however, the largest markets are not necessarily those with the highest success rate. Tabletop games and comics stand out, as those categories where fundraisers launch the most number of campaigns on average. Audio and Camera Gear have the highest number of images and videos used in the stories.

The following plots different variables by US state. Most campaigns are launched in CA and NY, and also have the largest venture capital investments. DE and UT have the highest success rates, however, this might not be a robust result due to the low number of campaigns in these states.

This plot shows the distribution of campaigns that are typically done by a fundraiser, as well as the average success likelihood by nr. of campaigns. It seems like the more often a creator launches a crowdfunding project the more successful it gets. This might be due to survivorship bias, however.

The following shows the distribution of funds raised, based on the three text metrics of title, description, and story. The optimal sentiment, emotionality, and length metric seem to be somewhat normally distributed, suggesting that extremes do not do well, whereas a moderate score benefits the campaign.

This plots a word cloud for every category and gives insights into the different topics in each category and also sheds light on the quality of the cleaned texts.